Key Factors Driving Parents to Immunized Their Children in India: a Preliminary Research
by Mochammad Miftahul Fahmi (5972035)
Hello, everyone! I hope you a beautiful day.
My name is Miftahul Fahmi, currently studying Management of Technology in Delft University of Technology. Here are some related assignments of my data science courses, which I take on November 2023 - April 2024. So, there will be more of codes that I will post here.
Seriously, why I need to post my codes here? The answer is simple:
Please, enjoy!
WHAT WILL WE DO HERE?
This code here will focus on Geographical Visualization and Analysis.
The steps:
In this part, main activities are:
# Declaration of Used Library
# For dataframe processing (shp and csv)
import pandas as pd
import geopandas as gpd
import numpy as np
from scipy.stats import pearsonr
from libpysal.weights import Queen
# For mapping geographical features
import osmnx as ox
import seaborn as sns
import contextily as ctx
from pysal.lib import weights
from pysal.explore import esda
from splot.esda import moran_scatterplot, lisa_cluster, plot_local_autocorrelation
# For plotting
import matplotlib.pyplot as plt
from matplotlib.patches import Patch
from matplotlib.lines import Line2D
from shapely.geometry import Point
%matplotlib inline
import mapclassify
%matplotlib inline
OMP: Info #276: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead. /Users/mochammadmiftahulfahmi/anaconda3/envs/install_gds_stack/lib/python3.9/site-packages/numba/core/decorators.py:262: NumbaDeprecationWarning: numba.generated_jit is deprecated. Please see the documentation at: https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-generated-jit for more information and advice on a suitable replacement. warnings.warn(msg, NumbaDeprecationWarning) /Users/mochammadmiftahulfahmi/anaconda3/envs/install_gds_stack/lib/python3.9/site-packages/quantecon/lss.py:20: NumbaDeprecationWarning: The 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details. def simulate_linear_model(A, x0, v, ts_length): /Users/mochammadmiftahulfahmi/anaconda3/envs/install_gds_stack/lib/python3.9/site-packages/spaghetti/network.py:40: FutureWarning: The next major release of pysal/spaghetti (2.0.0) will drop support for all ``libpysal.cg`` geometries. This change is a first step in refactoring ``spaghetti`` that is expected to result in dramatically reduced runtimes for network instantiation and operations. Users currently requiring network and point pattern input as ``libpysal.cg`` geometries should prepare for this simply by converting to ``shapely`` geometries. warnings.warn(dep_msg, FutureWarning, stacklevel=1)
Read the CSV file using read_csv. The csv was downloaded from open source Mission Antyodaya Village Facilities in SHRUG (https://www.devdatalab.org/shrug_download/). Information can be found here .
The analysis will be focused on "Subdistrict" and "District" level because of some reasons (explained on the Reflection at the bottom of Part 01).
# Read CSV file, at desired level
initial_data = pd.read_csv("./data/shrug-antyodaya-csv/antyodaya_pc11subdist.csv")
# The content of the CSV will be explained on Part 02
# Check the data at a glimpse using head
initial_data.head()
| pc11_state_id | pc11_district_id | pc11_subdistrict_id | total_population | male_population | female_population | total_hhd | total_hhd_engaged_in_farm_activi | total_hhd_engaged_in_non_farm_ac | is_govt_seed_centre_available | ... | open_uncovered_drainage | open_kuchha_drainage | fpo | no_electricity | internal_pucca_road | livestock_ext_services | _mean_p_miss | _core_p_miss | _target_weight_share | _target_group_max_weight_share | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 3 | 41 | 225 | 778.0 | 407.0 | 371.0 | 155.0 | 0.0 | 0.0 | 0.0 | ... | 1.0 | 0.0 | 0.0 | 0.00000 | 0.00000 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 |
| 1 | 12 | 245 | 1563 | 446.0 | 216.0 | 230.0 | 95.0 | 40.0 | 45.0 | 0.0 | ... | 0.0 | 1.0 | 0.0 | 0.00000 | 0.00000 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 |
| 2 | 12 | 250 | 1626 | 603.0 | 253.0 | 350.0 | 129.0 | 110.0 | 3.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.45937 | 0.54063 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 |
| 3 | 12 | 253 | 1679 | 1380.0 | 828.0 | 552.0 | 228.0 | 120.0 | 70.0 | 0.0 | ... | 0.0 | 1.0 | 0.0 | 0.00000 | 0.00000 | 1.0 | 0.0 | 0.0 | 1.0 | 1.0 |
| 4 | 21 | 371 | 2774 | 297.0 | 155.0 | 142.0 | 70.0 | 55.0 | 15.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.00000 | 1.00000 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 |
5 rows × 171 columns
Next, check the number of row and column, also each of the unique.
# Review the number of records (row) and columns
print("Check the Number of Rows and Columns")
num_rows, num_columns = initial_data.shape
print(f"Number of rows: {num_rows}")
print(f"Number of columns: {num_columns}")
Check the Number of Rows and Columns Number of rows: 5480 Number of columns: 171
# Get the name of the column (column title)
column_names = list(initial_data.columns)
print("The name of the columns of the data:" )
for i, column_name in enumerate(column_names, start=1):
print(f"{i}. {column_name}")
The name of the columns of the data: 1. pc11_state_id 2. pc11_district_id 3. pc11_subdistrict_id 4. total_population 5. male_population 6. female_population 7. total_hhd 8. total_hhd_engaged_in_farm_activi 9. total_hhd_engaged_in_non_farm_ac 10. is_govt_seed_centre_available 11. availability_of_watershed_dev_pr 12. availability_of_rain_harvest_sys 13. availability_of_food_storage_war 14. availability_of_farm_gate_proces 15. availability_of_custom_hiring_ce 16. total_cultivable_area_in_hac 17. net_sown_area_in_hac 18. net_sown_area_kharif_in_hac 19. net_sown_area_rabi_in_hac 20. net_sown_area_other_in_hac 21. is_soil_testing_centre_available 22. is_fertilizer_shop_available 23. no_of_farmers_using_drip_sprinkl 24. area_irrigated_in_hac 25. total_unirrigated_land_area_in_h 26. availability_of_milk_routes 27. availability_of_poultry_dev_proj 28. availability_of_goatary_dev_proj 29. availability_of_pigery_developme 30. is_veterinary_hospital_available 31. availability_of_fish_community_p 32. availability_of_fish_farming 33. availability_of_aquaculture_ext_ 34. total_hhd_with_kuccha_wall_kucch 35. total_hhd_have_got_pmay_house 36. total_hhd_in_pmay_permanent_wait 37. total_hhd_got_benefit_under_stat 38. total_hhd_in_perm_waitlist_under 39. piped_water_fully_covered 40. is_village_connected_to_all_weat 41. public_transport 42. availability_of_railway_station 43. total_hhd_having_pmsbhgy_benefit 44. availability_of_elect_supply_to_ 45. availability_of_solor_wind_energ 46. total_hhd_having_solor_wind_ener 47. availability_of_panchayat_bhawan 48. csc 49. public_info_board 50. is_common_pastures_available 51. total_hhd_availing_pmuy_benefits 52. availability_of_public_library 53. recreational_centre 54. is_bank_available 55. is_bank_buss_correspondent_with_ 56. is_atm_available 57. total_hhd_availing_pmjdy_bank_ac 58. is_post_office_available 59. telephone_services 60. is_broadband_available 61. is_pds_available 62. total_hhd_having_bpl_cards 63. availability_of_primary_school 64. is_primary_school_with_electrici 65. is_primary_school_with_computer_ 66. is_primary_school_with_playgroun 67. is_primary_school_have_drinking_ 68. availability_of_mid_day_meal_sch 69. total_primary_school_students 70. total_primary_school_teachers 71. availability_of_middle_school 72. availability_of_high_school 73. availability_of_ssc_school 74. no_of_children_not_attending_sch 75. availability_of_govt_degree_coll 76. total_grd_and_pg_in_village 77. is_vocational_edu_centre_availab 78. total_trained_trainee_under_skil 79. availability_of_jan_aushadhi_ken 80. total_hhd_registered_under_pmjay 81. is_community_waste_disposal_syst 82. total_hhd_with_clean_energy 83. is_community_biogas_waste_recycl 84. is_aanganwadi_centre_available 85. is_early_childhood_edu_provided_ 86. total_childs_aged_0_to_3_years 87. total_childs_aged_0_to_3_years_r 88. total_childs_aged_3_to_6_years_r 89. total_childs_aged_0_to_3_years_i 90. total_childs_categorized_non_stu 91. total_anemic_pregnant_women 92. total_anemic_adolescent_girls 93. total_underweight_child_age_unde 94. total_male_child_age_bw_0_6 95. total_female_child_age_bw_0_6 96. total_minority_children_getting_ 97. total_minority_hh_provided_bank_ 98. no_of_physically_challenged_recv 99. total_hhd_with_more_than_2_child 100. availability_of_mother_child_hea 101. total_hhd_availing_pension_under 102. total_shg 103. total_hhd_mobilized_into_shg 104. total_no_of_shg_promoted 105. total_hhd_mobilized_into_pg 106. total_shg_accessed_bank_loans 107. is_bee_farming 108. is_sericulture 109. is_handloom 110. is_handicrafts 111. availability_of_community_forest 112. availability_of_minor_forest_pro 113. total_hhd_source_of_minor_forest 114. availability_of_cottage_small_sc 115. total_hhd_engaged_cottage_small_ 116. availability_of_adult_edu_centre 117. total_no_of_registered_children_ 118. total_no_of_children_0_to_6_year 119. total_no_of_pregnant_women 120. total_no_of_pregnant_women_recei 121. total_no_of_lactating_mothers 122. total_no_of_lactating_mothers_re 123. total_no_of_women_delivered_babi 124. total_no_of_children_in_icds_cas 125. total_no_of_young_anemic_childre 126. total_no_of_newly_born_children 127. total_no_of_newly_born_underweig 128. total_hhd_not_having_sanitary_la 129. total_no_of_eligible_beneficiari 130. total_no_of_beneficiaries_receiv 131. gp_total_no_of_eligible_benefici 132. gp_total_no_of_beneficiaries_rec 133. gp_total_hhd_eligible_under_nfsa 134. gp_total_hhd_receiving_food_grai 135. total_no_of_farmers_registered_u 136. total_no_of_farmers_subscribed_a 137. total_no_of_farmers 138. total_no_of_farmers_received_ben 139. total_no_farmers_adopted_organic 140. total_no_of_farmers_add_fert_in_ 141. total_no_of_elected_representati 142. total_no_of_elect_rep_oriented_u 143. total_no_of_elect_rep_undergone_ 144. total_approved_labour_budget_for 145. total_expenditure_approved_under 146. total_area_covered_under_irrigat 147. total_hhd_having_piped_water_con 148. pc11_land_area 149. canal 150. surface_water 151. ground_water 152. other_irrigation 153. any_primary_sch_toilet 154. mandi 155. regular_market 156. weekly_haat 157. phc 158. chc 159. sub_centre 160. closed_drainage 161. open_covered_drainage 162. open_uncovered_drainage 163. open_kuchha_drainage 164. fpo 165. no_electricity 166. internal_pucca_road 167. livestock_ext_services 168. _mean_p_miss 169. _core_p_miss 170. _target_weight_share 171. _target_group_max_weight_share
Reflection
Some important reflection based on the decision of which data will be the main set:
A. The main dataset chosen
The main dataset will be SHRUG in subdistrict level because of important reasons and logic: 1) Regarding one of the goal of the research: to have a preliminary research and case study to prove initial hypothesis. Next, it involves some amenities (health related facilities) that is basically more granular (closer to subdistrict level than state wide). Further will be explained in Part 02 below.
2) Regarding the data analysis process:
Hence, I will use "Subdistrict" level csv as the initial main data. Furthermore, for shp file, I will use also "District" level SHP because shp file from "Subdistrict" resulting in 'many islands' during process of creating Spatial Weight Matrix (this is part of Non Linear Process).
The shrid csv level is not chosen because the availability of the data. As you can see on the SHRUG Documentation, the shrid level has 11% of missing data, while subdistrict and district level have only 7% and 4% respectively. Subdistrict-level data also may be more readily available and accurate compared to finer levels such as villages and towns (shrid).
On other hand, state level is too broad. State-level data is frequently used for policy advocacy and reporting to higher authorities, however it lacks of nuance of ammenities case study.
B. Inclusivity and Representativeness 1) Number of missing data As explained above, it is wiser to have lesser data missing in a dataset (assuming that the number of represented district is equal). The number of missing data in shrid level is above 10%, compared to other available dataset.
2) The chosen main dataset: subdistrict level contains no missing total_population data in from 5444 subdistricts and 613 districts. Based on reference [1], the data covers 5444 out of 6057 subdistricts (89%, 2018 data) and 613 out of 808 districts (76%, 2022 data), which already reached >75% of area in India. Calculation see below
C. Other Reflections using CDS Framework
[1] https://en.wikipedia.org/wiki/Administrative_divisions_of_India Note: data in the same year was hard to retrieved. Hence, I provide 2018 and 2022 census data.
# Here, we will calculate the number of unique values in subdistrict and district column
# However, we need to drop the NA value of total_population
# because in my opinion, total population could not be easily replaced by certain value, such as mean
# It makes more sense if we change the value to density x total area in each district or subdistrict
# This is high effort, while we just need to see the glimpse at initial time.
def calculate_num_unique_values(df, column):
# Filter out rows where 'total_population' is missing
filtered_df = df.dropna(subset=['total_population'])
# Calculate the number of unique values for the specified column
num_unique_values = filtered_df[column].nunique()
return num_unique_values
# Calculate the number of unique values for each specified column
num_unique_values_district = calculate_num_unique_values(initial_data, 'pc11_district_id')
num_unique_values_subdistrict = calculate_num_unique_values(initial_data, 'pc11_subdistrict_id')
# Print the number of unique values for each column
print("Number of unique values in pc11_district_id:", num_unique_values_district)
print("Number of unique values in pc11_subdistrict_id:", num_unique_values_subdistrict)
Number of unique values in pc11_district_id: 613 Number of unique values in pc11_subdistrict_id: 5444
In this part, main activities are:
Research Background
COVID-19 pandemic reminds us the importance of vaccine and immunization. Through immunization, our body will develop an immune foundation, thus preventing severe illnesses and contributing to herd immunity.Hence, the earlier the foundation build, the more we contribute to the herd immunity. Therefore, immunization for children is important. It safeguards against various and potentially life-threatening diseases. Protecting this vulnerable age group is vital for long-term public health and the well-being of the community.
This research is basically a preliminary research to find further factors driving immunization level among children aged 0-3 years. As initial start, there are some factors causing the immunization level, mainly economics, social, and political factors. Niranjan (2018)[1] underpins that women from the SHGs (Self Help Group; a community organization to help family and women financially) with health intervention were more likely to overall healthier practice, including providing age-approproate immunization
Kumar (2009)[2] found that educational level of women and husband, level of income, infrastructures impacts women participation of the SHGs. In this research, we will assume that those will also impact the level of immunization.
Research Objective
This program is part of preliminary research as initial start (test the wave on water) to understand further the driving factors impacting the level of immunization level on children aged 0-3 years. In the end, this preliminary research will narrow down which variables are likely to influence the level of immunization, allowing future research to focus more on important variables (while also acknowledging that preliminary research might require further investigation).
Scope of the Data
The SHRUG data of Mission Antyodaya Village Facilities contains large dataset of shrid, subdistrict, and district level of important factors related to facilities and socio-economics feature of India collected 2018-2020.
Some important sources:
[1] Saggurti, N., Atmavilas, Y., Porwal, A., Schooley, J., Das, R., Kande, N., Irani, L., & Hay, K. (2018). Effect of health intervention integration within women's self-help groups on collectivization and healthy practices around reproductive, maternal, neonatal and child health in rural India. PloS one, 13(8), e0202562. https://doi.org/10.1371/journal.pone.0202562
[2] Participation in Self-Help Group Activities and its Impacts: Evidence from South India (Online at https://mpra.ub.uni-muenchen.de/19943/ MPRA Paper No. 19943, posted 14 Jan 2010 00:06 UTC)
Hypothesis
To give a clear picture of the hypothesis, see diagram below

#
H1: The number of graduates and post graduates (represents the education level) positively impacts the level of immunized children (age 0-3 years).
H2: The number of Self Help Groups (SHG) (represents community help) positively impacts the level of immunized children (age 0-3 years)
H3: The number of Integrated Child Development Services (ICDS) beneficiaries (represents subsidy) positively impacts the level of immunized children (age 0-3 years)
H4: The number of health facilities in an area (represents infrastructure) positively impacts the level of immunized children (age 0-3 years)
H5: the level of immunized children (age 0-3 years) is affected by neighbouring area that also has certain level of immunization.
Hypothesis Explanation
H1: The number of graduates and post graduates (represents the education level, of parents) positively impacts the level of immunized children (age 0-3 years). The education level will improve the cognitive ability and science-consiousness about the fact that immunization will improve immune system of our body (acknowledge that it is science based evidence). It is expected that without prior knowledge of the benefit of the immunization, people will also not willing to do that to their children.
H2: The number of Self Help Groups (SHG) (represents community help) positively impacts the level of immunized children (age 0-3 years). Self Help Groups contribute to community well-being, potentially boosting child immunization rates (age 0-3) through shared resources, awareness, and support. It is expected that if a bunch group of people do immunization to their children, the willingness to provide immunization will also increase.
H3: The number of Integrated Child Development Services (ICDS) beneficiaries (represents subsidy) positively impacts the level of immunized children (age 0-3 years). This hypothesis builds on the idea that subsidies offered through ICDS contribute to improved accessibility, affordability, and awareness of healthcare services, including immunization.
H4: The number of health facilities in an area (represents infrastructure) positively impacts the level of immunized children (age 0-3 years). This hypothesis is occured because I think that a well-developed healthcare infrastructure facilitates increased immunization coverage. Higher number of health facilities per area are likely to provide better access to vaccination services, reducing barriers and encouraging parents to adhere to immunization schedules
H5: the level of immunized children (age 0-3 years) is affected by neighbouring area that also has certain level of immunization. The immunization practices of one area (subdistricts) may impact the immunization behavior of nearby area. It popped out from the idea that communities with higher immunization rates create a positive norm that influences neighboring communities.
Hypothesis Testing Method
To improve inclusivity, H4 will focus on three districts with high, average, and low immunization level on children age 0-3 years.
# Constructs or Variables
Variables that are investigated here and the explanation: | Variables | Represented by | Meaning | Unit | Type | |----------|----------|----------|----------|----------| | The number of graduates and post graduates (per capita) | total_grd_and_pg_in_village divided by total_population | Attainment of High Level Education (represents the awareness of parenting and health) per capita | % | independent var | | The number of SHG (per capita) | total_shg divided by total_population | SHG: Self Help Group, bottom up community organization (NGO) with goals to empower woman in financial and leadership (parenting) | (per capita) | independent var | | The number of ICDS beneficiaries (lactating woman) per capita of lactating women | total_no_of_lactating_mothers_re divided by total_no_of_lactating_mothers | ICDS: Integrated Child Development Services | % | independent var | | Total number of Health Facility per area (per district) | the data will be calculated using GeoDataFrame. | The ammenities will be get from OSMNx --> plot --> calculate the number of health facilities belonged to district, after mapping | no per km2 | independent var | | the number of immunized children (per capita of number of children 0-3 years) | total_childs_aged_0_to_3_years_i divided by total_childs_aged_0_to_3_years | Percentage of children (age 0-3 yrs) with immunization | % | dependent var |
Reasons of choosing the variables
Most of the selection of the variables in this preliminary research are based on these factors:
Explanation on each of the var (no 1-4 is independent (also no 6), no 5 is dependend)
Inclusion beneficial: In this research, we can see some area with lower immunization level. Thus, I will choose districts (and subdistricts inside it) that represents area with immunization level: high, average, low. It is possible because of the benefits of analyzing multiple data from various districts. This inclusion will lead to more representative data on all variables (area with high/low subsidy, etc).
Reflection
A. Possible Other Variables that will improve the OUTPUT of RESEARCH.
Factor: Education Level. Alternative variables:
Factor: Economy. Alternative variables:
Factor: Health Facilities. Alternative variables:
B. Possible Rejecting Condition
The hypothesis explained above would be rejected in various condition:
H1 --> Rejected if education level is not align (low correlation) with the immunization level of children age 0-3 years in India.
H2 --> Rejected if the presence of SHGs does not demonstrate a statistically significant correlation with increased immunization rates.
H3 --> Rejected if the communities with a higher number of ICDS beneficiaries do not exhibit a statistically significant increase in immunization rates
H4 --> Rejected if the presence of Health Facilities does not demonstrate a statistically significant correlation with increased immunization rates.
H5 --> Less correlation of area and its neighbouring (more area with high immunization level around low area with immunization level).
C. Possible Bias
Possible biases are explained below:
C.1 Possible cases in which the hypothesis will be falsely rejected
H2 --> Rejected, but turns out it should not be rejected. Because of SHGs is heterogeneous (dynamics) across different communities about immunization level, and the analysis does not account for this variability.
H4 --> Rejected, but it is false because of certain area has lower civilization area (more empty area), thus has very low facilities!
C.2 Possible cases in which the hypothesis will be falsely accepted
H1 --> Accepted, but it is false because we could not make sure that the degree is science and health related or not. It also contains efficiency (meaning that having degree not necessarily means that the person is also have enough cognitive capability).
H3 --> Accepted, but it is false because of other subsidy from government that directly related to the immunization (and they combined each other), for example: only ICDS beneficiaries could accept the subsidy for immunization.
D. Final Reflection and LIMITATION OF THE RESEARCH!!
Important perspective that I am not taking into account:
Some important Critical Data Science framework:
E. What I made differently this time (compared to previous assignment):
In this part, main activities are:
1. Data Subsetting
Since the variables have already defined on Part 02, we will only take the columns that related to the hypothesis (see table on Part 02). However, there are also important columns that we should not exclude, for example: the id of subdistrict and district (will be used for merging with shp files), population (might be use for normalize the numbers / to get percentage).
selected_columns = ['pc11_state_id','pc11_district_id','pc11_subdistrict_id',
'total_population','male_population','female_population','total_hhd',
'total_grd_and_pg_in_village','total_shg',
'total_no_of_lactating_mothers','total_no_of_lactating_mothers_re',
'total_childs_aged_0_to_3_years','total_childs_aged_0_to_3_years_i'
]
# Select only the columns in the 'selected_columns' array
subset_df = initial_data[selected_columns]
2. Manage Missing Data
First, we need to see which index that are NaN.
#Basically doing "for loops" and check every index in each of columns, is there any NA?
for x in selected_columns:
if subset_df[x].isna().any() == True:
na_loc = subset_df.index[subset_df[x].isna()].tolist()
print(f"{x} with total {len(na_loc)} NaN data ({round(len(na_loc)/len(subset_df[x])*100,ndigits=2)}%), at index {na_loc}")
total_population with total 36 NaN data (0.66%), at index [62, 288, 1526, 1560, 1561, 1580, 1793, 3540, 4131, 4149, 4167, 4175, 4274, 4307, 4313, 4322, 4325, 4326, 4327, 4328, 4329, 4339, 4354, 4356, 4543, 4680, 4704, 4719, 4720, 4800, 4844, 4859, 4883, 4929, 5115, 5413] male_population with total 36 NaN data (0.66%), at index [62, 288, 1526, 1560, 1561, 1580, 1793, 3540, 4131, 4149, 4167, 4175, 4274, 4307, 4313, 4322, 4325, 4326, 4327, 4328, 4329, 4339, 4354, 4356, 4543, 4680, 4704, 4719, 4720, 4800, 4844, 4859, 4883, 4929, 5115, 5413] female_population with total 36 NaN data (0.66%), at index [62, 288, 1526, 1560, 1561, 1580, 1793, 3540, 4131, 4149, 4167, 4175, 4274, 4307, 4313, 4322, 4325, 4326, 4327, 4328, 4329, 4339, 4354, 4356, 4543, 4680, 4704, 4719, 4720, 4800, 4844, 4859, 4883, 4929, 5115, 5413] total_hhd with total 36 NaN data (0.66%), at index [62, 288, 1526, 1560, 1561, 1580, 1793, 3540, 4131, 4149, 4167, 4175, 4274, 4307, 4313, 4322, 4325, 4326, 4327, 4328, 4329, 4339, 4354, 4356, 4543, 4680, 4704, 4719, 4720, 4800, 4844, 4859, 4883, 4929, 5115, 5413] total_grd_and_pg_in_village with total 36 NaN data (0.66%), at index [62, 288, 1526, 1560, 1561, 1580, 1793, 3540, 4131, 4149, 4167, 4175, 4274, 4307, 4313, 4322, 4325, 4326, 4327, 4328, 4329, 4339, 4354, 4356, 4543, 4680, 4704, 4719, 4720, 4800, 4844, 4859, 4883, 4929, 5115, 5413] total_shg with total 36 NaN data (0.66%), at index [62, 288, 1526, 1560, 1561, 1580, 1793, 3540, 4131, 4149, 4167, 4175, 4274, 4307, 4313, 4322, 4325, 4326, 4327, 4328, 4329, 4339, 4354, 4356, 4543, 4680, 4704, 4719, 4720, 4800, 4844, 4859, 4883, 4929, 5115, 5413] total_no_of_lactating_mothers with total 36 NaN data (0.66%), at index [62, 288, 1526, 1560, 1561, 1580, 1793, 3540, 4131, 4149, 4167, 4175, 4274, 4307, 4313, 4322, 4325, 4326, 4327, 4328, 4329, 4339, 4354, 4356, 4543, 4680, 4704, 4719, 4720, 4800, 4844, 4859, 4883, 4929, 5115, 5413] total_no_of_lactating_mothers_re with total 36 NaN data (0.66%), at index [62, 288, 1526, 1560, 1561, 1580, 1793, 3540, 4131, 4149, 4167, 4175, 4274, 4307, 4313, 4322, 4325, 4326, 4327, 4328, 4329, 4339, 4354, 4356, 4543, 4680, 4704, 4719, 4720, 4800, 4844, 4859, 4883, 4929, 5115, 5413] total_childs_aged_0_to_3_years with total 36 NaN data (0.66%), at index [62, 288, 1526, 1560, 1561, 1580, 1793, 3540, 4131, 4149, 4167, 4175, 4274, 4307, 4313, 4322, 4325, 4326, 4327, 4328, 4329, 4339, 4354, 4356, 4543, 4680, 4704, 4719, 4720, 4800, 4844, 4859, 4883, 4929, 5115, 5413] total_childs_aged_0_to_3_years_i with total 36 NaN data (0.66%), at index [62, 288, 1526, 1560, 1561, 1580, 1793, 3540, 4131, 4149, 4167, 4175, 4274, 4307, 4313, 4322, 4325, 4326, 4327, 4328, 4329, 4339, 4354, 4356, 4543, 4680, 4704, 4719, 4720, 4800, 4844, 4859, 4883, 4929, 5115, 5413]
We can see that each of the columns has different index with NA. Then, we need to use our intuition to delete/replace the NA value.
Here are some ideas that I use for this case:
# Remove NA value of point 1 and 2 above
columns_to_remove_na = ['total_shg', 'total_population']
# Drop rows with NA values in those columns
subset_df = subset_df.dropna(subset=columns_to_remove_na)
Also, since we will create a new column of per capita, it is possible to have 0/0 value. Then we need to delete the possible "/0" values.
columns_to_remove_zero = ['total_no_of_lactating_mothers']
# Drop rows where specified columns contain only 0
subset_df = subset_df.loc[~(subset_df[columns_to_remove_zero] == 0).all(axis=1)]
# DO the same for total child ages 0-3 yrs
columns_to_remove_zero = ['total_childs_aged_0_to_3_years']
subset_df = subset_df.loc[~(subset_df[columns_to_remove_zero] == 0).all(axis=1)]
After the NA data is all gone, we can do some calculation (adding important variables). We will need to make a new variables, so that it is more inclusive and fair (it is possible to have a case like the number of immunized children is higher in certain area because it has more children!).
List of new variables:
# No 01
subset_df['percent_immunized_child'] = (subset_df['total_childs_aged_0_to_3_years_i'] / subset_df['total_childs_aged_0_to_3_years']).round(3) * 100
# No 02
subset_df['percent_grd'] = (subset_df['total_grd_and_pg_in_village'] / subset_df['total_population']).round(3) * 100
# No 03
subset_df['percent_shg'] = (subset_df['total_shg'] / subset_df['total_population']).round(5)
# No 04
subset_df['percent_icds_women'] = (subset_df['total_no_of_lactating_mothers_re'] / subset_df['total_no_of_lactating_mothers']).round(3) * 100
Finally, check if there is more NAN value
# Recheck if there is NA value again
column_where_na = []
row_where_na = []
# Select all columns inside the subset_df DataFrame
all_columns = subset_df.columns
# Line below is used by me to check in each of column, if there is NA. then if there is NA, put it in the new empty array above.
for col in all_columns:
if subset_df[col].isna().any():
na_loc = subset_df.index[subset_df[col].isna()].tolist()
print(f"{col}: NaN at index {na_loc}")
column_where_na.append(col)
row_where_na.extend(na_loc)
if not column_where_na:
print("No NaN values found. Great!")
else:
print("There are NaN values. Please check.")
print("Columns with NaN values:", column_where_na)
print("Rows with NaN values:", row_where_na)
No NaN values found. Great!
3. Initial Visualization
AT FIRST we will try to focus on the absolute number (not per capita number) of variables to see the global overview.
Histogram will be used to see the profile of each of the variables. It will be mainly to see how centralized the data will be. Also, it is beneficial in the stage of "Quantiles".
# I defined which column that I will plot in histogram
columns_hist = [
'total_grd_and_pg_in_village',
'total_shg',
'total_no_of_lactating_mothers_re',
'total_childs_aged_0_to_3_years_i'
]
# Then, I set the number of columns and rows for the layout
num_cols = 2
num_rows = 2
# Set up subplots for each column
fig, axes = plt.subplots(nrows=num_rows, ncols=num_cols, figsize=(10, 6))
axes = axes.flatten()
# Create histograms in each of the columns
for i, column in enumerate(columns_hist):
axes[i].hist(subset_df[column].dropna(), bins=20, color='blue', alpha=0.7)
axes[i].set_xlabel('Value')
axes[i].set_ylabel('Frequency')
if column == 'total_grd_and_pg_in_village':
axes[i].set_title('Number of Graduates and Postgrads')
axes[i].set_xlim(0, 50000) # Set x-limit from 0 to 10000
if column == 'total_shg':
axes[i].set_title('Number of SHGs')
axes[i].set_xlim(0, 8000) # Set x-limit from 0 to 10000
if column == 'total_no_of_lactating_mothers_re':
axes[i].set_title('Number of Lactating Mothers Receiving ICDS')
axes[i].set_xlim(0, 12000) # Set x-limit from 0 to 10000
if column == 'total_childs_aged_0_to_3_years_i':
axes[i].set_title('Number of Children Age 0-3 Years with Immunization')
axes[i].set_xlim(0, 40000) # Set x-limit from 0 to 10000
# Add a single title for all subplots
fig.suptitle('Histogram of Socioeconomic Key Variables in India', fontsize=16)
plt.tight_layout()
plt.show()
Graphical Excellence
Next, we want to test hypothesis H1, H2, and H3 To test the hypothesis, the pearson correlation and heat map is used. Also, this is to make sure that there exist correlation of independent and dependent variable of the Hypothesis.
The plot will be completed with linear regression line and confidence interval to show the confident-spread of the data.
# Declareing which columns that we want to see the correlation
columns_corr = [
'total_childs_aged_0_to_3_years_i',
'total_grd_and_pg_in_village',
'total_shg',
'total_no_of_lactating_mothers_re'
]
# Create a subset dataframe with only the specified columns as specified above
subset_corr_df = subset_df[columns_corr]
# Change the name of the variable so that it is understandable by HUMAN!
subset_corr_df.columns = ['No. of Childs (0-3) Immunized','Education Level', 'No. of SHGs', 'No. of Lact. Mothers with ICDS']
# Create a pair plot
pair_plot = sns.pairplot(subset_corr_df, kind='scatter', diag_kind='hist')
# Change the colour to black (scatter plot)
for i, j in zip(*plt.np.triu_indices_from(pair_plot.axes, 1)):
sns.scatterplot(x=subset_corr_df.columns[j], y=subset_corr_df.columns[i], data=subset_corr_df, color='black', ax=pair_plot.axes[i, j])
# Add orange linear regression lines with confidence intervals
for i, j in zip(*plt.np.triu_indices_from(pair_plot.axes, 1)):
sns.regplot(x=subset_corr_df.columns[j], y=subset_corr_df.columns[i], data=subset_corr_df, scatter=False, color='orange', ax=pair_plot.axes[i, j])
# Add title
pair_plot.fig.suptitle('Pair Plot of Key Variables regarding Immunization Level of Childs (0-3) in India', y=1.02, fontsize=16)
# Show the plot
plt.show()
Graphical Excellence
From pairplot above, we can observe that the basic variable (not percapita) of Number of Children (0-3 years) immunized is correlated to the variables: Education Level, No of SHGs, and No of Lactating Mothers Receiving ICDS. To make it more clear, we can also see the heatmap.
# Create a heatmap of the correlation matrix
correlation_matrix = subset_corr_df.corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f', linewidths=.5)
# Show the plot
plt.title('Heatmap of Socio-economic Variables related to Immunization of Childs (0-3) in India', fontsize=12)
plt.show()
Graph Excellence
Some important that we get from the heatmap above:
Well we found that some independent variables are having correlation with other independent variables (not excellent candidate for independent variables, then). Next, we try to be more direct to the hypothesis.
Let us check all variables on H1, H2, H3, which are more per capita variables (more inclusive and just).
# Define which variables we want to observe
columns_name = [
'percent_immunized_child',
'percent_grd',
'percent_shg',
'percent_icds_women',
]
# Focus on dependent vs all independent variables on H1-H3
subset_corr_df = subset_df[columns_name]
subset_corr_df.columns = ['No. of Childs (0-3) Immunized','Education Level', 'No. of SHGs', 'No. of Lact. Mothers with ICDS']
# Create a pair plot with 'percent_immunized_child' on the y-axis and others on the x-axis
pair_plot = sns.pairplot(subset_corr_df, y_vars=['No. of Childs (0-3) Immunized'], x_vars=subset_corr_df.columns[1:], kind='scatter', diag_kind='hist')
# Show the plot
pair_plot.fig.suptitle('Pair Plot of Key Variables regarding Immunization Level of Childs (0-3) in India', y=1.12, fontsize=12)
plt.show()
Principle of excellence:
# Create a heatmap of the correlation matrix
correlation_matrix = subset_corr_df.corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f', linewidths=.5)
# Show the plot
plt.title('Heatmap of Socio-economic Variables (per capita) related to Immunization of Childs (0-3) in India', fontsize=12)
plt.show()
Graph Excellence
To make sure the significance, we will also calculate p value. Low p-value (typically under 5%) indicating a significant correlation between the variables. The null hypothesis for a correlation test typically assumes that there is no correlation between the variables.
# Calculate correlation matrix with p-values
corr_matrix = subset_corr_df.corr()
p_values = np.zeros_like(corr_matrix)
for i in range(corr_matrix.shape[0]):
for j in range(corr_matrix.shape[1]):
_, p_value = pearsonr(subset_corr_df.iloc[:, i], subset_corr_df.iloc[:, j])
p_values[i, j] = p_value
# Create a list of p-values
p_values_list = []
for i in range(corr_matrix.shape[0]):
for j in range(corr_matrix.shape[1]):
p_values_list.append((subset_corr_df.columns[i], subset_corr_df.columns[j], p_values[i, j]))
# Print the list of p-values
for p_value_info in p_values_list:
print(f"P-value between {p_value_info[0]} and {p_value_info[1]}: {p_value_info[2]:.3f}")
P-value between No. of Childs (0-3) Immunized and No. of Childs (0-3) Immunized: 0.000 P-value between No. of Childs (0-3) Immunized and Education Level: 0.000 P-value between No. of Childs (0-3) Immunized and No. of SHGs: 0.000 P-value between No. of Childs (0-3) Immunized and No. of Lact. Mothers with ICDS: 0.000 P-value between Education Level and No. of Childs (0-3) Immunized: 0.000 P-value between Education Level and Education Level: 0.000 P-value between Education Level and No. of SHGs: 0.000 P-value between Education Level and No. of Lact. Mothers with ICDS: 0.000 P-value between No. of SHGs and No. of Childs (0-3) Immunized: 0.000 P-value between No. of SHGs and Education Level: 0.000 P-value between No. of SHGs and No. of SHGs: 0.000 P-value between No. of SHGs and No. of Lact. Mothers with ICDS: 0.000 P-value between No. of Lact. Mothers with ICDS and No. of Childs (0-3) Immunized: 0.000 P-value between No. of Lact. Mothers with ICDS and Education Level: 0.000 P-value between No. of Lact. Mothers with ICDS and No. of SHGs: 0.000 P-value between No. of Lact. Mothers with ICDS and No. of Lact. Mothers with ICDS: 0.000
Reflection
Based on the provided p-values:
H1: Between "No. of Childs (0-3) Immunized" and "Education Level": p-value = 0.000 (significant) H2: Between "No. of Childs (0-3) Immunized" and "No. of SHGs": p-value = 0.000 (significant) H3: Between "No. of Childs (0-3) Immunized" and "No. of Lact. Mothers with ICDS": p-value = 0.000 (significant)
These small p-values support the idea that there is a statistically significant positive correlation between the number of immunized children (age 0-3 years) and each of the variables: "Education Level," "No. of SHGs," and "No. of Lact. Mothers with ICDS." This aligns with the positive impacts suggested by hypotheses H1, H2, and H3.
However, some independent variables are having correlation to other independent variables! We will need to evaluate it on next research because there is possibility of bias (foe example, instead of independent variables, there might be dependent variables that will be unveils). For temporary time, we will assume the H1-H3 using the significance above.
Critical DS Framework Reflection
Let us discuss further on next preliminary research, because we need more expert on discussing those cases (my current knowledge is lacking, I need Trivik :) )
Next, we will try to test H4 and H5.
For H4, we need OSMNx AMENITIES to get new variable "Number of Health Facilities in an Area"!
In this part, main activities are:
1. Merging shp and csv files
Here, we will merge the shp and csv file. The key index that combines them are "pc11_subdistrict_id" and "pc11_district_id". Since our analysis will be focused on subdistrict level, the file below will be combined:
map_path = './data/shrug-pc11subdist-poly-gpkg/subdistrict.gpkg'
map = gpd.read_file(map_path)
gdf = map
# Merge Data of subset_dataframe (subset_df) with the geo data frame (gdf)
# But first, let's examine the type of the column
gdf.info()
<class 'geopandas.geodataframe.GeoDataFrame'> RangeIndex: 5969 entries, 0 to 5968 Data columns (total 5 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 pc11_state_id 5969 non-null object 1 pc11_district_id 5969 non-null object 2 pc11_subdistrict_id 5969 non-null object 3 subdistrict_name 5966 non-null object 4 geometry 5969 non-null geometry dtypes: geometry(1), object(4) memory usage: 233.3+ KB
# and the subset_df
subset_df.info()
subset_df.columns
<class 'pandas.core.frame.DataFrame'> Index: 5420 entries, 1 to 5479 Data columns (total 17 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 pc11_state_id 5420 non-null int64 1 pc11_district_id 5420 non-null int64 2 pc11_subdistrict_id 5420 non-null int64 3 total_population 5420 non-null float64 4 male_population 5420 non-null float64 5 female_population 5420 non-null float64 6 total_hhd 5420 non-null float64 7 total_grd_and_pg_in_village 5420 non-null float64 8 total_shg 5420 non-null float64 9 total_no_of_lactating_mothers 5420 non-null float64 10 total_no_of_lactating_mothers_re 5420 non-null float64 11 total_childs_aged_0_to_3_years 5420 non-null float64 12 total_childs_aged_0_to_3_years_i 5420 non-null float64 13 percent_immunized_child 5420 non-null float64 14 percent_grd 5420 non-null float64 15 percent_shg 5420 non-null float64 16 percent_icds_women 5420 non-null float64 dtypes: float64(14), int64(3) memory usage: 762.2 KB
Index(['pc11_state_id', 'pc11_district_id', 'pc11_subdistrict_id',
'total_population', 'male_population', 'female_population', 'total_hhd',
'total_grd_and_pg_in_village', 'total_shg',
'total_no_of_lactating_mothers', 'total_no_of_lactating_mothers_re',
'total_childs_aged_0_to_3_years', 'total_childs_aged_0_to_3_years_i',
'percent_immunized_child', 'percent_grd', 'percent_shg',
'percent_icds_women'],
dtype='object')
As you can see the column 'pc11_subdistrict_id' of GDF is stored as object and the other one (the subset_df, from Part 02) is stored as int. We will change both the object and int datatype to float, to prevent if in the future we will need to merge some other data. Float has more 'flexible' value than 'integer' (for example, float can have decimal).
gdf.pc11_subdistrict_id = gdf.pc11_subdistrict_id.astype(float)
subset_df.pc11_subdistrict_id = subset_df.pc11_subdistrict_id.astype(float)
# Name the merged variable as geo_pop
geo_pop = gdf.merge(subset_df, on='pc11_subdistrict_id')
geo_pop.head()
| pc11_state_id_x | pc11_district_id_x | pc11_subdistrict_id | subdistrict_name | geometry | pc11_state_id_y | pc11_district_id_y | total_population | male_population | female_population | ... | total_grd_and_pg_in_village | total_shg | total_no_of_lactating_mothers | total_no_of_lactating_mothers_re | total_childs_aged_0_to_3_years | total_childs_aged_0_to_3_years_i | percent_immunized_child | percent_grd | percent_shg | percent_icds_women | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 24 | 468 | 3722.0 | Lakhpat | MULTIPOLYGON (((68.39398 23.44476, 68.39264 23... | 24 | 468 | 62117.708950 | 32088.750366 | 30025.930963 | ... | 394.599993 | 566.165208 | 227.071608 | 204.869050 | 3082.118619 | 2619.901747 | 85.0 | 0.6 | 0.00911 | 90.2 |
| 1 | 24 | 468 | 3723.0 | Rapar | MULTIPOLYGON (((70.53440 23.46489, 70.53222 23... | 24 | 468 | 201902.906291 | 105105.320376 | 96796.567185 | ... | 2531.541403 | 339.236735 | 930.099517 | 861.844679 | 6842.802254 | 4941.854062 | 72.2 | 1.3 | 0.00168 | 92.7 |
| 2 | 24 | 468 | 3724.0 | Bhachau | MULTIPOLYGON (((70.45008 23.01226, 70.44904 23... | 24 | 468 | 176504.000000 | 92678.000000 | 83460.000000 | ... | 2577.000000 | 234.000000 | 952.000000 | 790.000000 | 6897.000000 | 4373.000000 | 63.4 | 1.5 | 0.00133 | 83.0 |
| 3 | 24 | 468 | 3725.0 | Anjar | POLYGON ((70.23631 23.40849, 70.23631 23.40545... | 24 | 468 | 150105.112093 | 79896.579777 | 70092.396031 | ... | 3643.775958 | 491.505351 | 1147.882751 | 1027.598741 | 5673.050161 | 4004.627988 | 70.6 | 2.4 | 0.00327 | 89.5 |
| 4 | 24 | 468 | 3726.0 | Bhuj | POLYGON ((69.78433 23.99110, 69.79143 23.98750... | 24 | 468 | 238834.277866 | 119211.778880 | 113669.738389 | ... | 2819.277177 | 230.560895 | 629.484862 | 570.504168 | 7436.929336 | 5231.051377 | 70.3 | 1.2 | 0.00097 | 90.6 |
5 rows × 21 columns
# Because pc11_state_id_x = pc11_state_id_y and pc11_state_id_x = pc11_state_id_y
# We will delete column of y
# and rename the column pc11_state_id_x = pc11_state_id and pc11_state_id_x = pc11_state_id
geo_pop = geo_pop.drop(['pc11_state_id_y', 'pc11_district_id_y'], axis=1)
# Rename columns
geo_pop = geo_pop.rename(columns={'pc11_state_id_x': 'pc11_state_id', 'pc11_district_id_x': 'pc11_district_id'})
# Check if the crs is lose. Turns out that the crs is not lose, it still uses EPSG 4326. Seems fine, right?? :D
geo_pop.crs
<Geographic 2D CRS: EPSG:4326> Name: WGS 84 Axis Info [ellipsoidal]: - Lat[north]: Geodetic latitude (degree) - Lon[east]: Geodetic longitude (degree) Area of Use: - name: World. - bounds: (-180.0, -90.0, 180.0, 90.0) Datum: World Geodetic System 1984 ensemble - Ellipsoid: WGS 84 - Prime Meridian: Greenwich
Below, we will also try to merge between "district.gpkg" and "subset_df". This is one part of "non linear" process, because in some lines code below, at determining "spatial weight", the island was somehow too much (using the geo_pop above)! I think this is because of using "subdistrict.gpkg" data. So, for convenience and better data quality, I will also merge "district.gpkg" and "subset_df".
When islands are not surrounded by similar values, they can disrupt the spatial autocorrelation patterns that are essential for many spatial analysis techniques.
mapdist_path = './data/shrug-pc11dist-poly-gpkg/district.gpkg'
map_dist = gpd.read_file(mapdist_path)
Since the GPKG data above is in district, while "subset_df" is in subdistrict, we need to create pivot containing sum of all the columns (total population, total childs, etc)
columns_to_sum = ['total_population', 'male_population', 'female_population', 'total_hhd',
'total_grd_and_pg_in_village', 'total_shg',
'total_no_of_lactating_mothers', 'total_no_of_lactating_mothers_re',
'total_childs_aged_0_to_3_years', 'total_childs_aged_0_to_3_years_i']
# Create a pivot table
pivot_table_df = subset_df.pivot_table(values=columns_to_sum, index='pc11_district_id', aggfunc='sum')
pivot_table_df
| female_population | male_population | total_childs_aged_0_to_3_years | total_childs_aged_0_to_3_years_i | total_grd_and_pg_in_village | total_hhd | total_no_of_lactating_mothers | total_no_of_lactating_mothers_re | total_population | total_shg | |
|---|---|---|---|---|---|---|---|---|---|---|
| pc11_district_id | ||||||||||
| 1 | 23523.138329 | 26510.879626 | 2761.074569 | 2590.694210 | 704.340235 | 8981.478930 | 505.056064 | 479.194760 | 5.004162e+04 | 1.521253 |
| 2 | 360076.838747 | 392652.138478 | 29465.549515 | 19313.974815 | 33116.408459 | 124419.391360 | 4322.707964 | 3710.070887 | 7.551146e+05 | 1291.782192 |
| 3 | 45598.474752 | 47615.448440 | 2837.299473 | 2422.499434 | 3636.436542 | 16684.780804 | 791.205308 | 756.308394 | 9.321936e+04 | 350.046346 |
| 4 | 78150.007679 | 78037.294461 | 7344.751470 | 5050.990314 | 4134.530494 | 23630.458275 | 1106.910964 | 961.424367 | 1.760563e+05 | 653.763285 |
| 5 | 230263.244779 | 240528.135141 | 18368.934333 | 11508.827636 | 11365.975014 | 92762.747369 | 5829.314660 | 4037.745077 | 4.717603e+05 | 1120.292371 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 632 | 450007.906654 | 447742.145346 | 18915.735027 | 15588.340247 | 40968.391785 | 289061.923978 | 4448.443895 | 4225.381399 | 9.028997e+05 | 4561.162494 |
| 633 | 596281.419949 | 595956.683533 | 23808.052542 | 19714.351280 | 67835.559509 | 391963.629490 | 4917.232443 | 4590.071292 | 1.195993e+06 | 4454.343346 |
| 638 | 3154.866472 | 3307.155473 | 257.055194 | 213.852640 | 70.204149 | 1883.631334 | 22.681341 | 22.681341 | 6.462022e+03 | 35.642107 |
| 639 | 54314.627363 | 57265.113737 | 3669.293400 | 3063.362231 | 5526.490359 | 29831.195841 | 662.187411 | 636.354232 | 1.116362e+05 | 743.045473 |
| 640 | 53502.795148 | 58770.591779 | 3692.091661 | 3419.911097 | 4963.049693 | 26783.402756 | 739.984115 | 678.350943 | 1.123320e+05 | 725.685726 |
613 rows × 10 columns
Do the merge of the two dataset (GPKG and pivot_table) by the same step as subdistrict above.
# Convert to astype
map_dist.pc11_district_id = map_dist.pc11_district_id.astype(float)
geo_pop.pc11_district_id = geo_pop.pc11_district_id.astype(float)
pivot_table_df.index = pivot_table_df.index.astype(float)
# Next, merge using the indices
geo_pop_district = map_dist.merge(pivot_table_df, left_on='pc11_district_id', right_index=True)
geo_pop_district.head()
| pc11_state_id | pc11_district_id | district_name | geometry | female_population | male_population | total_childs_aged_0_to_3_years | total_childs_aged_0_to_3_years_i | total_grd_and_pg_in_village | total_hhd | total_no_of_lactating_mothers | total_no_of_lactating_mothers_re | total_population | total_shg | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 24 | 468.0 | Kachchh | MULTIPOLYGON (((70.45008 23.01226, 70.44904 23... | 6.918732e+05 | 7.659721e+05 | 52789.062307 | 36124.140136 | 29509.874064 | 332446.747616 | 7194.496554 | 6452.874545 | 1.465846e+06 | 5262.200653 |
| 1 | 24 | 469.0 | Banas Kantha | MULTIPOLYGON (((71.24964 24.20926, 71.24207 24... | 1.490841e+06 | 1.603543e+06 | 137018.185689 | 93865.516320 | 103408.588403 | 607873.050328 | 19340.773617 | 16878.197941 | 3.273774e+06 | 13546.601715 |
| 2 | 24 | 470.0 | Patan | MULTIPOLYGON (((71.42507 23.96967, 71.42497 23... | 5.204153e+05 | 5.651527e+05 | 49457.172519 | 30594.736482 | 64222.702638 | 258319.762048 | 9340.013570 | 7797.987739 | 1.089937e+06 | 14351.448633 |
| 3 | 24 | 471.0 | Mahesana | POLYGON ((72.79975 24.07615, 72.80022 24.07529... | 7.639744e+05 | 8.327545e+05 | 59425.341078 | 49149.072387 | 123971.352675 | 380288.478687 | 8296.862992 | 6994.736144 | 1.617866e+06 | 15483.493324 |
| 4 | 24 | 472.0 | Sabar Kantha | POLYGON ((73.14784 24.47759, 73.14773 24.47410... | 1.096367e+06 | 1.183334e+06 | 87560.815945 | 66651.606962 | 152636.838216 | 582154.513255 | 16705.055796 | 14374.734389 | 2.320364e+06 | 20658.175179 |
Also, do not forget to calculate the needed variables in Hypothesis!
# No 01
geo_pop_district['percent_immunized_child'] = (geo_pop_district['total_childs_aged_0_to_3_years_i'] / geo_pop_district['total_childs_aged_0_to_3_years']).round(3) * 100
# No 02
geo_pop_district['percent_grd'] = (geo_pop_district['total_grd_and_pg_in_village'] / geo_pop_district['total_population']).round(3) * 100
# No 03
geo_pop_district['percent_shg'] = (geo_pop_district['total_shg'] / geo_pop_district['total_population']).round(5)
# No 04
geo_pop_district['percent_icds_women'] = (geo_pop_district['total_no_of_lactating_mothers_re'] / geo_pop_district['total_no_of_lactating_mothers']).round(3) * 100
Finally, I think this is also good idea to put "district name" into the "geo_pop" data because it contains only subdistrict name. Then, we can use vlookup from the "geo_pop_district".
geo_pop_district.pc11_district_id = map_dist.pc11_district_id.astype(float)
pivot_table_df.index = pivot_table_df.index.astype(float)
# Merge based on pc11_district_id
geo_pop = pd.merge(geo_pop, geo_pop_district[['pc11_district_id', 'district_name']], on='pc11_district_id', how='left')
2. Spatial Visualization
Let us see a glimpse of what India looks like!
# Setup figure and axis
f, ax = plt.subplots(1)
# Add layer of polygons on the axis
gdf.plot(ax=ax)
# Add figure title
f.suptitle('India Map')
# Display
plt.axis('equal')
# Add labels to axes
ax.set_xlabel('Longitude')
ax.set_ylabel('Latitude')
plt.show()
Graphical Excellence
We can also add the map from CTX! To make it more beautiful and you can easily imagine it above the atlas map
# We can also add the map from CTX! To make it more beautiful and you can easily imagine it above the atlas map
# Load local Authority geometries using pc11_subdistrict_id code as index
map = gpd.read_file(map_path).set_index('pc11_subdistrict_id', drop=False)
ax = map.plot(figsize=(6, 6), alpha=0.3, color='red');
# Add background map, expressing target CRS so the basemap can be
# reprojected (warped)
ctx.add_basemap(ax, crs=map.crs)
ax.set_title('Map of India')
Text(0.5, 1.0, 'Map of India')
Graphical Excellence
Let us see if, geographically (by map here), the top 1000 subdistricts (represent 20%) on each of our variables, are close to each others Will be investigated: Immunization of child vs total lactating mothers that are under ICDS (total_no_of_lactating_mothers_re)
The purpose of this 1000 is to give a glimpse (like head and tail) of the dataset
number_cut = 1000
highest_imun = geo_pop.sort_values('percent_immunized_child',ascending=False).head(number_cut)
highest_shgs = geo_pop.sort_values('percent_shg',ascending=False).head(number_cut)
highest_grd = geo_pop.sort_values('percent_grd',ascending=False).head(number_cut)
highest_ICDS = geo_pop.sort_values('percent_icds_women',ascending=False).head(number_cut)
f, ax = plt.subplots(1, figsize=(6, 6))
background = geo_pop.plot(facecolor='blue', linewidth=0.025, ax=ax, alpha=0.2)
highest_ICDS.plot(alpha=0.9, facecolor='green', linewidth=0, ax=ax)
highest_imun.plot(alpha=0.7, facecolor='red', linewidth=0, ax=ax)
ax.set_axis_off()
f.suptitle('1000 Subdistricts with Highest Number of Lactating Mother (ICDS) and Children with Immunization in India')
# Create custom legend entries
legend_elements = [
Line2D([0], [0], color='blue', alpha=0.2, lw=2, label='All Areas'),
Line2D([0], [0], marker='o', color='green', label='Highest % Lact. Mother w ICDS'),
Line2D([0], [0], marker='o', color='red', label='Highest % Child w Immunization'),
]
ax.legend(handles=legend_elements)
plt.axis('equal')
plt.show()
Graphical Excellence
From picture above we can see that, the strongest correlation variables (explained in previous heatmap) also depicted in the map. Subdistrict with high number of lactating mother receiving ICDS is also relatively around high level of children's immunization.
We will see clearer using Choroplets.
3. Create Choropleths
Chorophlets that I will make is based on the option "Quantiles". I chose this because, as you can see on Part 02, most of the data are skewed to the left (positive skew), more or less resembles Poisson Distribution with low lambda. This quantiles is suitable for skewed distributions and helps in handling outliers. It provides good representation by placing an approximately equal number of observations in each interval.
num_division = 11 # Divide into how many section
classi = mapclassify.Quantiles(geo_pop['percent_immunized_child'], k=num_division)
classi
Quantiles
Interval Count
------------------------
[ 0.00, 47.30] | 497
( 47.30, 57.12] | 494
( 57.12, 63.40] | 490
( 63.40, 68.20] | 511
( 68.20, 72.50] | 490
( 72.50, 76.20] | 498
( 76.20, 80.10] | 492
( 80.10, 83.70] | 494
( 83.70, 87.60] | 498
( 87.60, 92.50] | 491
( 92.50, 100.00] | 492
# Set up the figure to explain the Quantiles that I made
f, ax = plt.subplots(1, figsize=(8, 5))
# Plot the kernel density estimation (KDE)
sns.kdeplot(geo_pop['percent_immunized_child'], shade=True)
# Add a blue tick for every value at the bottom of the plot (rugs)
sns.rugplot(geo_pop['percent_immunized_child'], alpha=0.5)
# Loop over each break point and plot a vertical red line
for cut in classi.bins:
plt.axvline(cut, color='red', linewidth=0.75)
# Display image
ax.set_title('KDE for Immunized Children (0-3) Variable')
plt.show()
/var/folders/h2/ws329tlx2tj_9vdbqdyj50v40000gn/T/ipykernel_13231/960233916.py:4: FutureWarning: `shade` is now deprecated in favor of `fill`; setting `fill=True`. This will become an error in seaborn v0.14.0; please update your code. sns.kdeplot(geo_pop['percent_immunized_child'], shade=True)
Graph Excellence
# Create Choropleths using the 'cividis' colormap
geo_pop.plot(column='percent_immunized_child', scheme='QUANTILES', alpha=1, k=num_division, \
cmap='cividis', # Use cividis colormap
edgecolor='w', linewidth=0.1, figsize=(8, 8))
# Add title with proper centering
plt.title('Choropleths of Immunized Children (0-3) in India, per Subdistricts', pad=20, loc='center')
# Add colorbar with proper positioning
cax = plt.axes([0.95, 0.1, 0.03, 0.8]) # [x, y, width, height]
sm = plt.cm.ScalarMappable(cmap='cividis', norm=plt.Normalize(vmin=geo_pop['percent_immunized_child'].min(), vmax=geo_pop['percent_immunized_child'].max()))
sm._A = [] # empty array for the data range
cbar = plt.colorbar(sm, cax=cax)
# Set colorbar label
cbar.set_label('Percentage of children aged 0-3 years immunized')
# Show the plot
plt.show()
Graph Excellence
We can see that there are some subdistrict (and district) with high number of immunization and lower number of immunization. We will do case study on three district below and its neighbours (retrieved via mapes https://www.mapsofindia.com/districts-india/ to see where is the neighbour, which also checked using Queen Neighbour during Spatial Weight Calculation). I am also considering inclusivity
Where does those 3 subdistricts come from? I analyzed using and zoom into Top and Bottom district of % of immunization of children.
# Here, I will show you Top 10
n_top_districts = 10
# Set the seaborn color palette to a colorblind-friendly palette
sns.set_palette("colorblind")
# Sort the DataFrame by 'percent_immunized_child' in descending order
sorted_df = geo_pop_district.sort_values(by='percent_immunized_child', ascending=False)
# Select the top districts based on the variable n_top_districts
top_districts = sorted_df.head(n_top_districts)
# Create the bar chart for 'percent_immunized_child'
fig, ax1 = plt.subplots(figsize=(12, 6))
color = sns.color_palette()[0] # Use the first color from the palette
ax1.set_xlabel('District')
ax1.set_ylabel('Percent Immunized Child', color=color)
ax1.bar(top_districts['district_name'], top_districts['percent_immunized_child'], color=color, label='Percent Immunized Child')
ax1.tick_params(axis='y', labelcolor=color)
ax1.set_ylim(0, 100) # Set the y-axis limit for 'Percent Immunized Child'
# Create the line chart for 'percent_icds_women'
ax2 = ax1.twinx()
color = sns.color_palette()[1] # Use the second color from the palette
ax2.set_ylabel('Percent ICDS Women', color=color)
ax2.plot(top_districts['district_name'], top_districts['percent_icds_women'], color=color, marker='o', linestyle='--', label='Percent ICDS Women')
ax2.tick_params(axis='y', labelcolor=color)
ax2.set_ylim(0, 100) # Set the y-axis limit for 'Percent ICDS Women'
plt.title(f'{n_top_districts} Districts with Highest Percentage of Immunized Child and ICDS Women')
fig.tight_layout()
fig.legend(loc='upper left', bbox_to_anchor=(0.8, 0.2)) # Set the legend to a custom position
plt.show()
Graph Excellence
As we can see above, there is some outlier case (in Janjgir, the percent of immunized children is high but the number of woman receiving ICDS is very low!) After some check in detail, turns out that this is outlier because the population of Janjgir is very low. Thus, we will subset more on population of district under 10.000
# Sort the DataFrame by 'total_population' in ascending order
geo_pop_district.sort_values(by='total_population').head(5)
# To ma
desired_column_order = ['pc11_state_id', 'pc11_district_id', 'district_name', 'geometry',
'female_population', 'male_population',
'total_childs_aged_0_to_3_years', 'total_childs_aged_0_to_3_years_i',
'total_grd_and_pg_in_village', 'total_hhd',
'total_no_of_lactating_mothers', 'total_no_of_lactating_mothers_re',
'total_shg', 'percent_immunized_child',
'percent_grd', 'percent_shg', 'percent_icds_women', 'total_population']
# Reorder the columns
geo_pop_district = geo_pop_district[desired_column_order]
We can also see here that some districts indeed contain very small total population
selected_columns = ['pc11_state_id', 'pc11_district_id', 'district_name', 'geometry',
'total_population', 'percent_immunized_child']
geo_pop_district[selected_columns].sort_values(by='total_population').head()
| pc11_state_id | pc11_district_id | district_name | geometry | total_population | percent_immunized_child | |
|---|---|---|---|---|---|---|
| 423 | 12 | 257.0 | Dibang Valley | POLYGON ((96.11495 29.39999, 96.14150 29.36881... | 292.000000 | 62.5 |
| 61 | 22 | 405.0 | Janjgir - Champa | POLYGON ((82.64385 22.21103, 82.64583 22.20764... | 2038.000000 | 100.0 |
| 62 | 22 | 406.0 | Bilaspur | POLYGON ((82.08579 23.11360, 82.08697 23.11342... | 2363.000000 | 10.3 |
| 637 | 35 | 638.0 | Nicobars | MULTIPOLYGON (((93.71364 7.21016, 93.71297 7.2... | 6462.021944 | 83.2 |
| 451 | 15 | 285.0 | Serchhip | POLYGON ((92.89217 23.56077, 92.89312 23.55949... | 15921.602747 | 96.7 |
# Filter rows where 'total_population' is greater than or equal to 10000
geo_pop_district = geo_pop_district[geo_pop_district['total_population'] >= 10000]
To make sure that the deleted data is insignificant, we can see how spread the data is using box plot. We can see below that the data is already covered by the the span, hence below 10.000 can be deleted.
# Set up the figure and axes
plt.figure(figsize=(8, 4))
# Create a box plot
sns.boxplot(x='total_population', data=geo_pop_district, color='skyblue')
# Add title and labels
plt.title('Total Population by District')
plt.xlabel('Total Population')
# Show the plot
plt.show()
Next, check again what districts with highest and lowest immunization percentage.
n_top_districts = 10
# Set the seaborn color palette to a colorblind-friendly palette
sns.set_palette("colorblind")
# Sort the DataFrame by 'percent_immunized_child' in descending order
sorted_df = geo_pop_district.sort_values(by='percent_immunized_child', ascending=False)
# Select the top districts based on the variable n_top_districts
top_districts = sorted_df.head(n_top_districts)
# Create the bar chart for 'percent_immunized_child'
fig, ax1 = plt.subplots(figsize=(12, 6))
color = sns.color_palette()[0] # Use the first color from the palette
ax1.set_xlabel('District')
ax1.set_ylabel('Percent Immunized Child', color=color)
ax1.bar(top_districts['district_name'], top_districts['percent_immunized_child'], color=color, label='Percent Immunized Child')
ax1.tick_params(axis='y', labelcolor=color)
ax1.set_ylim(50, 100) # Set the y-axis limit for 'Percent Immunized Child'
# Create the line chart for 'percent_icds_women'
ax2 = ax1.twinx()
color = sns.color_palette()[1] # Use the second color from the palette
ax2.set_ylabel('Percent ICDS Women', color=color)
ax2.plot(top_districts['district_name'], top_districts['percent_icds_women'], color=color, marker='o', linestyle='--', label='Percent ICDS Women')
ax2.tick_params(axis='y', labelcolor=color)
ax2.set_ylim(50, 100) # Set the y-axis limit for 'Percent ICDS Women'
plt.title(f'{n_top_districts} Districts with Highest Percentage of Immunized Child and ICDS Women')
fig.tight_layout()
fig.legend(loc='upper left', bbox_to_anchor=(0.8, 0.2)) # Set the legend to a custom position
plt.show()
Graph Excellence
Also, do the same for the lowest level
n_bottom_districts = 10
# Set the seaborn color palette to a colorblind-friendly palette
sns.set_palette("colorblind")
# Sort the DataFrame by 'percent_immunized_child' in ascending order to get the bottom districts
sorted_df_bottom = geo_pop_district.sort_values(by='percent_immunized_child', ascending=True)
# Select the bottom districts based on the variable n_bottom_districts
bottom_districts = sorted_df_bottom.head(n_bottom_districts)
# Create the bar chart for 'percent_immunized_child'
fig, ax1 = plt.subplots(figsize=(12, 6))
color = sns.color_palette()[0] # Use the first color from the palette
ax1.set_xlabel('District')
ax1.set_ylabel('Percent Immunized Child', color=color)
ax1.bar(bottom_districts['district_name'], bottom_districts['percent_immunized_child'], color=color, label='Percent Immunized Child')
ax1.tick_params(axis='y', labelcolor=color)
ax1.set_ylim(0, 100) # Set the y-axis limit for 'Percent Immunized Child'
# Create the line chart for 'percent_icds_women'
ax2 = ax1.twinx()
color = sns.color_palette()[1] # Use the second color from the palette
ax2.set_ylabel('Percent ICDS Women', color=color)
ax2.plot(bottom_districts['district_name'], bottom_districts['percent_icds_women'], color=color, marker='o', linestyle='--', label='Percent ICDS Women')
ax2.tick_params(axis='y', labelcolor=color)
ax2.set_ylim(0, 100) # Set the y-axis limit for 'Percent ICDS Women'
plt.title(f'{n_bottom_districts} Districts with Lowest Percentage of Immunized Child and ICDS Women')
fig.tight_layout()
fig.legend(loc='upper left', bbox_to_anchor=(0.8, 0.2)) # Set the legend to a custom position
plt.show()
Graph Excellence
After deciding on the three inclusive districts, we will deep dive and back to focus on testing H4 in each of the districts.
First district analysis + AMENITIES related to healthcare.
Ammenities we can get from OSMNX, after reading the documentation "building": "hospital", "amenity": ["hospital", "clinic","doctors","social_facility"
Pathanamthitta District
# Change 'percentage_immunized' to 'percent_immunized_child' in geo_pop dataframe
geo_pop = geo_pop.rename(columns={'percentage_immunized': 'percent_immunized_child'})
# Select a district with high percentage of immunization also with its neighbour
# To know the neighbour, I use maps here https://www.mapsofindia.com/districts-india/
selected_districts = ['Pathanamthitta', 'Alappuzha', 'Kottayam', 'Kollam', 'Idukki', 'Ernakulam', 'Thrissur', 'Theni', 'Tenkasi', 'Tirunelveli']
area1 = selected_districts # I will put the above districts to different array, so in the end I can combine all selected districts!
filtered_geo_pop = geo_pop[geo_pop['district_name'].isin(selected_districts)]
# Input data from OSMX
# I use bbox that I input manually after knowing the boundary of the districts and neighbours
north, south, east, west = 10.4, 8.8, 77.4, 76.2
health_facil = ox.features_from_bbox(north, south, east, west, tags={"building": "hospital", "amenity": ["hospital", "clinic","doctors","social_facility"]})
#water = ox.features_from_bbox(north, south, east, west, tags={'natural': 'water'})
# Convert the features into GeoDataFrame
health_facil_gdf = gpd.GeoDataFrame.from_features(health_facil)
num_division = 20 # Divide into how many sections
classi = mapclassify.Quantiles(filtered_geo_pop['percent_immunized_child'], k=num_division)
fig, ax = plt.subplots(figsize=(8, 8))
filtered_geo_pop.plot(ax=ax, column='percent_immunized_child', scheme='QUANTILES', alpha=1, k=num_division, \
cmap='viridis', # Use viridis colormap
edgecolor='w', linewidth=0.1, figsize=(8, 8))
# Add colorbar
cax = plt.axes([0.95, 0.1, 0.03, 0.8]) # [x, y, width, height]
sm = plt.cm.ScalarMappable(cmap='viridis', norm=plt.Normalize(vmin=geo_pop['percent_immunized_child'].min(), vmax=geo_pop['percent_immunized_child'].max()))
sm._A = [] # empty array for the data range
cbar = plt.colorbar(sm, cax=cax)
# Set colorbar label
cbar.set_label('Percentage of Children (0-3 years) with Immunization')
health_facil_gdf.plot(ax=ax, color='red', markersize=10, label='Hospitals and Health Facility') # Use dark red color
#water_gdf.plot(ax=ax, color='blue')
# Add legend
ax.legend()
ax.set_ylim(south-0.2, north+0.2)
ax.set_xlim(west-0.2, east+0.2)
ctx.add_basemap(ax, crs=geo_pop.crs)
ax.set_title('Immunization Percentage in Pathanamthitta Districts and surroundings')
# Show the plot
plt.show()
/Users/mochammadmiftahulfahmi/anaconda3/envs/install_gds_stack/lib/python3.9/site-packages/mapclassify/classifiers.py:1592: UserWarning: Not enough unique values in array to form 20 classes. Setting k to 17. self.bins = quantile(y, k=k) /Users/mochammadmiftahulfahmi/anaconda3/envs/install_gds_stack/lib/python3.9/site-packages/mapclassify/classifiers.py:1592: UserWarning: Not enough unique values in array to form 20 classes. Setting k to 17. self.bins = quantile(y, k=k) /var/folders/h2/ws329tlx2tj_9vdbqdyj50v40000gn/T/ipykernel_13231/2884155885.py:43: UserWarning: Legend does not support handles for PatchCollection instances. See: https://matplotlib.org/stable/tutorials/intermediate/legend_guide.html#implementing-a-custom-legend-handler ax.legend()
Graph Excellence
Evaluation
We can see on this district that both of high level and low level of % of immunization on children has spread hospital and health facility (not necessarily centered to one area).
selected_districts = ['Anjaw', 'Lohit', 'Changlang', 'Namsai']
area2 = selected_districts # I will put the above districts into a different array so that I can combine all selected districts!
filtered_geo_pop = geo_pop[geo_pop['district_name'].isin(selected_districts)]
north, south, east, west = 28.4, 27, 97.5, 95.5
health_facil = ox.features_from_bbox(north, south, east, west, tags={"building": "hospital", "amenity": ["hospital", "clinic", "doctors", "social_facility"]})
# Convert the features into GeoDataFrame
health_facil_gdf = gpd.GeoDataFrame.from_features(health_facil)
num_division = 20 # Divide into how many sections
classi = mapclassify.Quantiles(filtered_geo_pop['percent_immunized_child'], k=num_division)
fig, ax = plt.subplots(figsize=(10, 10))
filtered_geo_pop.plot(ax=ax, column='percent_immunized_child', scheme='QUANTILES', alpha=1, k=num_division, \
cmap='viridis', # Use viridis colormap
edgecolor='w', linewidth=0.1, figsize=(8, 8))
# Add colorbar
cax = plt.axes([0.95, 0.1, 0.03, 0.8]) # [x, y, width, height]
sm = plt.cm.ScalarMappable(cmap='viridis', norm=plt.Normalize(vmin=geo_pop['percent_immunized_child'].min(), vmax=geo_pop['percent_immunized_child'].max()))
sm._A = [] # empty array for the data range
cbar = plt.colorbar(sm, cax=cax)
# Set colorbar label
cbar.set_label('Percentage of Children (0-3 years) with Immunization')
health_facil_gdf.plot(ax=ax, color='red', markersize=10, label='Hospitals and Health Facility') # Use dark red color
# Add legend
ax.legend()
ctx.add_basemap(ax, crs=geo_pop.crs)
ax.set_title('Immunization Percentage in Anjaw Districts and surroundings')
# Show the plot
plt.show()
/Users/mochammadmiftahulfahmi/anaconda3/envs/install_gds_stack/lib/python3.9/site-packages/mapclassify/classifiers.py:1592: UserWarning: Not enough unique values in array to form 20 classes. Setting k to 17. self.bins = quantile(y, k=k) /Users/mochammadmiftahulfahmi/anaconda3/envs/install_gds_stack/lib/python3.9/site-packages/mapclassify/classifiers.py:1592: UserWarning: Not enough unique values in array to form 20 classes. Setting k to 17. self.bins = quantile(y, k=k)
Graph Excellence
Evaluation
We can see also on this district it has very less of health related facilities. This is because we choose this district as one of the lowest percentage of % immunization on children age 0-3 years. But as you can clearly see that both district with high level and low level of % of immunization on children also has very little of health facilities.
selected_districts = ['Bhandara', 'Nagpur', 'Gondiya', 'Seoni', 'Balaghat', 'Chhindwara', 'Garhchirolo', 'Chandrapur', 'Wardha']
area3 = selected_districts # I will put the above districts into a different array so that I can combine all selected districts!
filtered_geo_pop = geo_pop[geo_pop['district_name'].isin(selected_districts)]
north, south, east, west = 22.7, 19.5, 81, 78.5
health_facil = ox.features_from_bbox(north, south, east, west, tags={"building": "hospital", "amenity": ["hospital", "clinic", "doctors", "social_facility"]})
# Convert the features into GeoDataFrame
health_facil_gdf = gpd.GeoDataFrame.from_features(health_facil)
num_division = 20 # Divide into how many sections
classi = mapclassify.Quantiles(filtered_geo_pop['percent_immunized_child'], k=num_division)
fig, ax = plt.subplots(figsize=(10, 10))
filtered_geo_pop.plot(ax=ax, column='percent_immunized_child', scheme='QUANTILES', alpha=1, k=num_division, \
cmap='viridis', # Use viridis colormap
edgecolor='w', linewidth=0.1, figsize=(8, 8))
# Add colorbar
cax = plt.axes([0.95, 0.1, 0.03, 0.8]) # [x, y, width, height]
sm = plt.cm.ScalarMappable(cmap='viridis', norm=plt.Normalize(vmin=geo_pop['percent_immunized_child'].min(), vmax=geo_pop['percent_immunized_child'].max()))
sm._A = [] # empty array for the data range
cbar = plt.colorbar(sm, cax=cax)
# Set colorbar label
cbar.set_label('Percentage of Children (0-3 years) with Immunization')
health_facil_gdf.plot(ax=ax, color='red', markersize=10, label='Hospitals and Health Facility') # Use dark red color
# Add legend
ax.legend()
ctx.add_basemap(ax, crs=geo_pop.crs)
# Show the plot
ax.set_title('Immunization Percentage in Bhandara Districts and surroundings')
plt.show()
/var/folders/h2/ws329tlx2tj_9vdbqdyj50v40000gn/T/ipykernel_13231/203090821.py:32: UserWarning: Legend does not support handles for PatchCollection instances. See: https://matplotlib.org/stable/tutorials/intermediate/legend_guide.html#implementing-a-custom-legend-handler ax.legend()
Graph Excellence
Evaluation
After seeing those three representable districts, can we conclude that the H4 is rejected? Because the location of the health facility is not also concentrated to districts with low % of immunization. But well, let us not try to use our eyes only, let us check the data and use pearson correlation again.
From graphs above, we need to get new column to be added to geo_pop dataframe.
Use steps below:
# Join each hospital+health facility with the corresponding subdistrict
# Here, I use "inner" (use intersection of the geometry) and "within" (choose that is inside)
# The data that will be joined are: health facility gdf and the main dataframe (filtered based on districts)
selected_districts = area1 + area2 + area3
filtered_geo_pop = geo_pop[geo_pop['district_name'].isin(selected_districts)]
joined_data = gpd.sjoin(health_facil_gdf, filtered_geo_pop, how="inner", op="within")
# Calculate the number of hospitals per area
hospital_counts = joined_data.groupby('subdistrict_name').size().reset_index(name='health_facil_count')
# Calculate the area of each subdistrict
calc_area = filtered_geo_pop[['subdistrict_name','district_name','geometry','percent_immunized_child']].copy()
calc_area = calc_area.to_crs('EPSG:32633')
calc_area['total_area'] = calc_area.geometry.area / 10**6
# Merge hospital_counts and district_areas to create the new array
health_facility_density_df = pd.merge(hospital_counts, calc_area[['district_name','subdistrict_name','percent_immunized_child','total_area','geometry']], on='subdistrict_name', how='outer')
health_facility_density_df = health_facility_density_df.fillna(0)
health_facility_density_df['health_facility/sq.km'] = health_facility_density_df['health_facil_count'] / health_facility_density_df['total_area']
column_order = ['district_name', 'subdistrict_name','health_facil_count', 'total_area','health_facility/sq.km','percent_immunized_child','geometry']
health_facility_density_df = health_facility_density_df[column_order]
health_facility_density_df.head()
/Users/mochammadmiftahulfahmi/anaconda3/envs/install_gds_stack/lib/python3.9/site-packages/IPython/core/interactiveshell.py:3488: FutureWarning: The `op` parameter is deprecated and will be removed in a future release. Please use the `predicate` parameter instead. if await self.run_code(code, result, async_=asy): /var/folders/h2/ws329tlx2tj_9vdbqdyj50v40000gn/T/ipykernel_13231/4145092681.py:7: UserWarning: CRS mismatch between the CRS of left geometries and the CRS of right geometries. Use `to_crs()` to reproject one of the input geometries to match the CRS of the other. Left CRS: None Right CRS: EPSG:4326 joined_data = gpd.sjoin(health_facil_gdf, filtered_geo_pop, how="inner", op="within")
| district_name | subdistrict_name | health_facil_count | total_area | health_facility/sq.km | percent_immunized_child | geometry | |
|---|---|---|---|---|---|---|---|
| 0 | Chhindwara | Amarwara | 4.0 | 2994.261732 | 0.001336 | 70.6 | POLYGON ((8136551.045 4848193.458, 8136662.527... |
| 1 | Gondiya | Amgaon | 1.0 | 1105.078742 | 0.000905 | 71.0 | POLYGON ((8404800.713 4832755.934, 8404840.361... |
| 2 | Gondiya | Arjuni Morgaon | 3.0 | 3585.795810 | 0.000837 | 82.4 | POLYGON ((8447580.202 4719914.218, 8448069.824... |
| 3 | Balaghat | Baihar | 7.0 | 9699.919428 | 0.000722 | 74.2 | POLYGON ((8362487.172 5026853.348, 8362637.339... |
| 4 | Balaghat | Balaghat | 17.0 | 4077.470261 | 0.004169 | 66.0 | POLYGON ((8270767.000 4931971.672, 8271153.673... |
from scipy.stats import pearsonr
# Calculate correlation coefficient and p-value
correlation_coefficient, p_value = pearsonr(health_facility_density_df['health_facility/sq.km'], health_facility_density_df['percent_immunized_child'])
# Print results
print(f'Correlation coefficient: {correlation_coefficient}')
print(f'P-value: {p_value}')
Correlation coefficient: -0.045525608575021206 P-value: 0.5737801079884187
# Scatter plot with regression line
plt.figure(figsize=(10, 6))
sns.regplot(x='health_facility/sq.km', y='percent_immunized_child', data=health_facility_density_df)
plt.title('Health Facility Density vs. Percentage of Immunized Children')
plt.xlabel('Health Facility Density (per sq. km)')
plt.ylabel('Percentage Children (0-3) Immunized')
plt.xlim(0, 0.001) # Set x-axis limits
plt.grid(True)
plt.show()
Principle of excellence:
Reflection
As we can see above that the p-value > 5%, hence the correlation is not significance. Thus, there might not any relation between Percentage of Children with Immunization and Health related facility per sq km (H4). H4 is rejected!
Critical Data Science Reflection
In this part, main activities are:
This part is mainly to test final Hypothesis, the H5.
H5: the level of immunized children (age 0-3 years) is affected by neighbouring area that also has certain level of immunization. The immunization practices of one area (subdistricts) may impact the immunization behavior of nearby area. It popped out from the idea that communities with higher immunization rates create a positive norm that influences neighboring communities.
Main idea: to check how certain area is correlated to its neighbour in the context of "percentage of immunization on children (0-3 years)", we will both do Global an Local Autocorrelation.
1. Creating Spatial Weight
Before doing ESDA on both Global and Local, we need to create spatial weight matrix. Here, we will use "queen" which defines neighbors as those sharing a common boundary or vertex. The "queen" is used because it is less sensitive to irregular shapes (especially regions) in the spatial distribution of the units
variables = 'pc11_district_id'
weight_df = geo_pop_district
# Check for duplicate entries
if weight_df[variables].duplicated().any():
# If duplicates exist, drop them or handle them as per your data requirements
weight_df = weight_df.drop_duplicates(subset=variables)
# Set the index
weight_df = weight_df.set_index(variables, drop=False)
# Create spatial weights matrix
w_queen = Queen.from_dataframe(weight_df, idVariable=variables)
/var/folders/h2/ws329tlx2tj_9vdbqdyj50v40000gn/T/ipykernel_13231/3861510448.py:13: FutureWarning: `idVariable` is deprecated and will be removed in future. Use `ids` instead. w_queen = Queen.from_dataframe(weight_df, idVariable=variables) /Users/mochammadmiftahulfahmi/anaconda3/envs/install_gds_stack/lib/python3.9/site-packages/libpysal/weights/weights.py:224: UserWarning: The weights matrix is not fully connected: There are 25 disconnected components. There are 5 islands with ids: 71.0, 88.0, 528.0, 639.0, 640.0. warnings.warn(message)
As you can see above, the number of island is quite low, in which we can ignore for further exploration. The total number of district here is 604, which makes 5 island is only less than 1% (ignore them will not much affect the data).
w_queen.islands
[71.0, 88.0, 528.0, 639.0, 640.0]
weight_df.set_index('pc11_district_id')
weight_df = weight_df.drop(w_queen.islands)
Once we have the set of local authorities that are not an island, we need to re-calculate the weights matrix:
w_queen = Queen.from_dataframe(weight_df, idVariable=variables)
w_queen.transform = 'R'
/var/folders/h2/ws329tlx2tj_9vdbqdyj50v40000gn/T/ipykernel_13231/3237309050.py:1: FutureWarning: `idVariable` is deprecated and will be removed in future. Use `ids` instead. w_queen = Queen.from_dataframe(weight_df, idVariable=variables) /Users/mochammadmiftahulfahmi/anaconda3/envs/install_gds_stack/lib/python3.9/site-packages/libpysal/weights/weights.py:224: UserWarning: The weights matrix is not fully connected: There are 20 disconnected components. warnings.warn(message)
Also check the plot to see how the neighbours number are spreaded! As we can see on graph below, average number of neighbour in certain districts are around 4.
queen_card = pd.Series(w_queen.cardinalities)
sns.distplot(queen_card, bins=10)
plt.title('Distribution of Cardinalities for Queen Contiguity')
/var/folders/h2/ws329tlx2tj_9vdbqdyj50v40000gn/T/ipykernel_13231/3785541745.py:2: UserWarning: `distplot` is a deprecated function and will be removed in seaborn v0.14.0. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). For a guide to updating your code to use the new functions, please see https://gist.github.com/mwaskom/de44147ed2974457ad6372750bbe5751 sns.distplot(queen_card, bins=10)
Text(0.5, 1.0, 'Distribution of Cardinalities for Queen Contiguity')
Principle of excellence:
2. Spatial Lag
Next, we will determine the Spatial Lag.
We will create some new variables:
weight_df['w_percent_immunized_child'] = weights.lag_spatial(w_queen, weight_df['percent_immunized_child'])
# Check whether it works or not :D
weight_df[['pc11_district_id', 'district_name','percent_immunized_child', 'w_percent_immunized_child']].head(5)
| pc11_district_id | district_name | percent_immunized_child | w_percent_immunized_child | |
|---|---|---|---|---|
| pc11_district_id | ||||
| 468.0 | 468.0 | Kachchh | 68.4 | 69.75 |
| 469.0 | 469.0 | Banas Kantha | 68.5 | 65.15 |
| 470.0 | 470.0 | Patan | 61.9 | 68.45 |
| 471.0 | 471.0 | Mahesana | 82.7 | 72.00 |
| 472.0 | 472.0 | Sabar Kantha | 76.1 | 72.60 |
For example, let us check if some district have neighbour
w_queen.neighbors[468.0]
[477.0, 476.0, 469.0, 470.0]
Next, create the normalized variables (std)
weight_df['percent_immunized_child_std'] = (weight_df['percent_immunized_child'] - weight_df['percent_immunized_child'].mean()) / weight_df['percent_immunized_child'].std()
weight_df['w_percent_immunized_child_std'] = weights.lag_spatial(w_queen, weight_df['percent_immunized_child_std'])
Moran Plot
Finally, we will see how the w_percent_immunized_child_std correlates with percent_immunized_child_std
# Setup the figure and axis
f, ax = plt.subplots(1, figsize=(6, 6))
# Plot values
sns.regplot(x='percent_immunized_child_std', y='w_percent_immunized_child_std', data=weight_df, ci=None)
# Add vertical and horizontal lines
plt.axvline(0, c='k', alpha=0.5)
plt.axhline(0, c='k', alpha=0.5)
# Display
plt.title('Moran Plot of Standardized Childs Immunization Percentages')
plt.show()
Principle of excellence:
In the context percentage of immunization on children above, the Moran Plot can be interpreted along the lines of: districts display positive spatial autocorrelation in the way provide immunization to childrens. This means that the districts with high percentage of immunization tend to be located nearby other districts where a significant percentage of immunization happened, and viceversa.
Moran I
In order to calculate Moran's I in our dataset, we can call a specific function in PySAL directly:
mi = esda.Moran(weight_df['percent_immunized_child'], w_queen)
print("Moran I value = ")
mi.I
Moran I value =
0.586662400879172
moran_scatterplot(mi);
Principle of excellence:
Local Spatial autocorrelation
Moran's I provides an indication of overall clustering but doesn't pinpoint the specific locations of clusters. To identify the spatial distribution at a finer level, local measures of spatial autocorrelation are crucial. Unlike global measures that analyze the entire dataset, local measures operate on individual observations, offering detailed insights instead of summarizing the entire map.
# Setup the figure and axis
f, ax = plt.subplots(1, figsize=(6, 6))
# Plot values
sns.regplot(x='percent_immunized_child_std', y='w_percent_immunized_child_std', data=weight_df, ci=None)
# Add vertical and horizontal lines
plt.axvline(0, c='k', alpha=0.5)
plt.axhline(0, c='k', alpha=0.5)
plt.text(1.75, 0.5, "HH", fontsize=25)
plt.text(1.5, -1.5, "HL", fontsize=25)
plt.text(-2, 1, "LH", fontsize=25)
plt.text(-1.5, -2.5, "LL", fontsize=25)
# Display
plt.show()
Next identify cases in which the comparison between the value of an observation and the average of its neighbors is either more similar (HH, LL) or dissimilar (HL, LH) than we would expect from pure chance. In order to do so, we need LISA
lisa = esda.Moran_Local(weight_df['percent_immunized_child'], w_queen)
All we need to pass is the variable of interest -percentage of Leave votes- and the spatial weights that describes the neighborhood relations between the different observation that make up the dataset.
Next, to make the map making more straightforward, it is convenient to pull them out and insert them in the main data table (weight_df)
# Break observations into significant or not
weight_df['significant'] = lisa.p_sim < 0.05
# Store the quadrant they belong to
weight_df['quadrant'] = lisa.q
Next, create the significant of the dataframe, followed by the quadrant.
weight_df['significant'].head(20)
pc11_district_id 468.0 False 469.0 False 470.0 False 471.0 False 472.0 False 473.0 False 474.0 False 475.0 False 476.0 False 477.0 False 478.0 False 479.0 False 480.0 False 481.0 False 482.0 False 483.0 False 484.0 False 485.0 False 486.0 False 487.0 True Name: significant, dtype: bool
weight_df['quadrant'].head()
pc11_district_id 468.0 3 469.0 3 470.0 3 471.0 1 472.0 1 Name: quadrant, dtype: int64
The correspondence between the numbers in the variable and the actual quadrants is as follows:
HH (High-High): Explanation: Districts in the High-High quadrant have high immunization percentages and are surrounded by other districts with high immunization percentages. Conclusion: This indicates spatial clustering of high immunization rates, suggesting a positive spatial autocorrelation. High-performing districts are geographically close to other high-performing districts.
LL (Low-Low): Explanation: Districts in the Low-Low quadrant have low immunization percentages and are surrounded by other districts with low immunization percentages. Conclusion: This suggests spatial clustering of low immunization rates, indicating a negative spatial autocorrelation. Districts with low immunization rates tend to be close to other districts with similarly low rates.
LH (Low-High): Explanation: Districts in the Low-High quadrant have low immunization percentages but are surrounded by districts with high immunization percentages. Conclusion: This pattern indicates spatial outliers where districts with low immunization rates are in close proximity to districts with high rates. It may be an area of concern or a unique spatial pattern.
HL (High-Low): Explanation: Districts in the High-Low quadrant have high immunization percentages but are surrounded by districts with low immunization percentages. Conclusion: Similar to LH, this pattern represents spatial outliers where districts with high immunization rates are in close proximity to districts with low rates. This, too, could be an area of concern or a unique spatial pattern.
Next, With these two significant and quadrant, we can build a typical LISA cluster map:
lisa_cluster(lisa, weight_df);
Graph Excellence
# In more detailed graph
# Setup the figure and axis
f, ax = plt.subplots(1, figsize=(6, 6))
# Plot insignificant clusters
ns = weight_df.loc[weight_df['significant']==False, 'geometry']
ns.plot(ax=ax, color='k')
# Plot HH clusters
hh = weight_df.loc[(weight_df['quadrant']==1) & (weight_df['significant']==True), 'geometry']
hh.plot(ax=ax, color=sns.color_palette("coolwarm", as_cmap=True)(0.8)) # Red
# Plot LL clusters
ll = weight_df.loc[(weight_df['quadrant']==3) & (weight_df['significant']==True), 'geometry']
ll.plot(ax=ax, color=sns.color_palette("coolwarm", as_cmap=True)(0.2)) # Blue
# Plot LH clusters
lh = weight_df.loc[(weight_df['quadrant']==2) & (weight_df['significant']==True), 'geometry']
lh.plot(ax=ax, color=sns.color_palette("coolwarm", as_cmap=True)(0.5)) # Greenish-Blue
# Plot HL clusters
hl = weight_df.loc[(weight_df['quadrant']==4) & (weight_df['significant']==True), 'geometry']
hl.plot(ax=ax, color=sns.color_palette("coolwarm", as_cmap=True)(0.35)) # Light Red
# Style and draw
f.suptitle('LISA for Immunization Percentage on Children 0-3 yrs in India', size=14)
f.set_facecolor('0.75')
ax.set_axis_off()
plt.show()
Graph Excellence
plot_local_autocorrelation(lisa, weight_df, 'percent_immunized_child');
Graph Excellence
Reflection
Regarding to H5:
if mi.p_sim < 0.05:
print("Spatial autocorrelation is statistically significant.")
if mi.I > 0:
print("Positive spatial autocorrelation (HH or LL).")
else:
print("Negative spatial autocorrelation (LH or HL).")
else:
print("No statistically significant spatial autocorrelation detected.")
Spatial autocorrelation is statistically significant. Positive spatial autocorrelation (HH or LL).
After taking a long marathon, hearing my story, I would like to thanks to Trivik and TAs for answering my question in lab! and of course reading my story here.
Conclusion
Some hypothesis in this preliminary research are tested, but all of them needs to be evaluated with expertise such as Trivik to fully delete possible biases. So far:
Final words: Some important variables that readers can use further to analyze the level or percentage of immunization on children age 0-3 years:
Reflection using Critical Data Science Framework
Important notes to the user of this preliminary research, in context of "Communicat e Findings Transparency and accessibility of the results"